Welcome to Kuan-I (Brian) Lu’s data science portfolio page! Here, I will showcase some of my notable project throughout my journey in the data science world across all disciplines. The subjects varies from Machine Learning to Time Series, and from topics as profound as Signal Processing to the more simple Linear Regression and Data Cleaning Exercises. Enjoy!


Machine Learning: Predicting Success of Startups

During my internship in Industrial Technology Research Institute (ITRI) in Taiwan, I get to access the dataset containing information of startup companies from the beginning of Taiwanese history to 2023. From the perspective of an investor or an entrepreneur, the ability to predict if a new company is going to be successful is crucial. With the right insight and analysis, investors can know which companies to pay attention to, and entrepreneurs can know which aspect they can focus on to improve. However, most industrial analyst rely on their qualitative ability, their domain knowledge, and their experience to determine which companies have better potential in succeeding. The purpose of this project is to tackle this problem from a quantitative perspective, which work as a useful tool to support or undermine certain decision or insights made by the industrial analysts in ITRI.

In this project, the main obstacle is the enormous missing values dues to the nature of the lack of transparency of start up companies. To deal with this issue, we had to impute some of the missing values by being creative and fill it based on the industries cross by the companies and the average value within these industries. As important as imputing the missing values is the decision and the selection process of the “useful” variables. Thankfully with the professional guidance and insight of my mentor during my internship here in ITRI, we have reached satisfying results. To view the results and further details about this machine learning project, click on this link: Predicting Success of Startups


Natural Language Processing: Book Review Analysis with BERT

This project is the culmination of a series of courses on Natural Language Processing(NLP) with multiple Large Language Models(LLM). In this project, my team and I were given a task to predict a rating’s score (integers from 1 to 5) based on the rating’s text with sample size of 3 million. From all of the models we know, we decided to approach this problem by fine-tuning a DistilBERT model given the constraint of our environment and the difficulty of goals. We collectively did extensive researches understanding and comparing BERT among other models (LSTM, ELMo, etc.), as well as discovering the mechanism and advantages of DistilBERT over other BERT models. After researching and became fully prepared, we then started the experiment, which I personally did most of the heavy lifting coding and trial and error. I also made the flowchart above to better illustrate the model structure used in this project.

The greatest challenge of this project is the computation burden. Even though DistilBERT is a lighter version of BERT base and we also adapted other strategies (downsampling, layer freezing, etc.) to relief the amount of computational work to be done, we still had to wait for a long time to fit each model. Working on Google Colab, we had to purchase computational units multiple times for trial and error of model structures, which is not only monetarily demanding but also extremely time consuming. Gladly we ended up with successful and desirable results with the final model taking almost 8 hours to run. For more details of both the research and the experiment section, click here: Book Review Analysis with BERT


Research: Evolution of Machine Learning in Financial Risk Management: A Survey

After learning about the underlying mechanisms of Neural Network and their application in Natural Language Processing, I wanted to combine what I’ve learned with my domain knowledge of finance, thus created this research project. As one may know, Financial Risk Management (FRM) can be a very data-heavy task, from fraud detection to predicting market risks, we can help saving so much unnecessary cost if we can successfully identify financial risks by leveraging the data we have. This research works as a guideline and a handbook for those in financial risk management seeking to understand the more data driven side of the task. We examined the evolution of machine learning tools used in FRM, from the simplest Bayesian Inference to find the best parameters of time series models to explain the VaR trend, to using neural network for a wider variety of tasks, to using reinforcement learning to utilize results from neural network to assist the typically man-made decision. By providing the details of the methods, applications, and related experiment results, we provide a comprehensive view of the application of machine learning in FRM tasks throughout the years.

Most stages of this research went smoothly, this is no surprise since most of the learning happened in the NLP course, where I learned about the mechanisms of neural networks. However, finding adequate and reliable sources to show the proficiency of these selected methods did consume much of my energy and time. In addition, I also learned the advanced knowledge of reinforcement learning and their applications independently solely off of previous papers. This is an interesting process and I really enjoyed extracting knowledge by digging through previous scholarly works. To read the complete version of the research paper, please click the following link: Evolution of machine learning in financial risk management: A survey

For additional information, this paper in currently under review of prestigious scientific conferences and is projected to be published in several months.


Research: Compressed Sensing and Basis Pursuit

Comporessed sensing and signal decomposition is an essential part of modern data science. This research paper is based on the ideas pointed out in “Atomic Decomposition by Basis Pursuit”, which compared four approaches to reconstructing signals given the sparse data, these four methods are: Method of Frames, Matching Pursuit, Best Orthogonal Basis, and Basis Pursuit. This project examined the performance of these four methods and concluded that Basis Pursuit is indeed the most consistent and efficient method. Extensive mathematical proof is provided to support the validity of the algorithm which Basis Pursuit based on.

As an undergraduate student, the largest obstacle in this project is to understand an advanced topic with little prior domain knowledge. I’ve done the most self-learning and information mining in this project compared to any other projects I did, and I really utilized strategies to make sure I efficiently understood related knowledge from reliable sources. After understanding the information needed, I further enhance my knowledge by explaining the ideas with mathematical proof and laid out pseudo algorithms to show that I’m capable of clearly explain the idea of signal processing and Basis Pursuit. The details of the full research can be found here: Compressed Sensing and Basis Pursuit


Recommender Systems with Deep Learning: Matrix Factorization

Recommender system algorithms differ significantly from traditional supervised machine learning approaches. Building upon a previous project, this work explores the application of deep learning methods in training a recommendation model. Starting from a collaborative filtering algorithm, this project addresses challenges such as the size and sparsity of the user-item matrix, aiming to train latent user and item matrices to estimate ratings. As a baseline, we implemented Singular Value Decomposition (SVD) to benchmark the performance of the proposed model. This project is presented as a vignette, designed to efficiently and effectively convey the core ideas to readers.

The project was developed in Python using packages such as PyTorch, NumPy, and pandas. The dataset utilized was the Amazon Electronics Product Ratings dataset from Kaggle, focusing exclusively on user-item ratings. Results revealed that the baseline SVD model performed comparably to the deep learning approach. These findings suggest a potential direction for enhancing the recommender system, such as combining deep learning methods with other approaches and predicting ratings through a weighted average of multiple models. For more details about the project, please visit the following link: Recommender Systems with Deep Learning: Matrix Factorization


Recommender Systems with Collaborative Filtering: Recommending Restaurants

Recommender systems play a pivotal role in shaping our daily experiences, influencing what we see, explore, and consume. This project delves into recommender system algorithms, focusing on the widely used item-based collaborative filtering method. Using the Yelp Dataset, we constructed two models designed to recommend multiple restaurants based on either a given restaurant or a user. The first model follows a straightforward algorithmic pipeline: given a restaurant as input, the model calculates pairwise cosine similarities with all other restaurants in the dataset and returns the top 10 most similar restaurants. The second model builds upon this structure. When provided with a user, it identifies the user’s top-rated restaurants and applies the same cosine similarity method to generate recommendations for each of these restaurants. The model then filters out previously visited restaurants and refines the list to return the top 10 most similar recommendations.

One key challenge in building recommender systems is evaluating their “success.” In a business context, a model is deemed successful if it satisfies customers by recommending products they appreciate. To assess the performance of our models in this project, we selected a popular restaurant near our school, widely favored by the community. By analyzing the recommendations, we found that the model effectively identified other popular restaurants in the area, indicating its practical utility. This project yielded additional interesting findings. To explore these insights further, please visit the following link: Recommending Restaurants


Machine Learning: Predicting Spotify Hits

We all love listening to music, and often times there will be songs so popular that most of your friends had hear about it before you introduce the songs to them, these songs are so popular that we can simply assume everyone knows the song without checking if they actually know it or not. These popular songs, or hits, are a sign of the production of a successful music producers, and many people think luck and the randomness of the market is what create these songs. But, what if there’s a clear trend that we can trace to predict which songs are more likely to become hits?

In this project, we worked with a data containing over one million songs on Spotify, a streaming platform that has global presence, with the intention to predict if a song is going to be a hit or not. We started off with extensive cleaning of the data, spending much effort performing feature engineering on 1.) the popularity variable to enhance generally and classify songs into three classes (hits, regulars, flops) and 2.) the artists variable to put artists in tiers so we can avoid the redundant dummy variables. We also fitted and tuned with four types of machine learning models, selected the best one and assessed its performance. For more details about this project, please click on this link to see the whole project: Predicting Spotify Hits


Time Series: SPX Stock Price

Dealing with financial data can be tricky sometimes, as the observed data, (for example, price) are often time dependent and we cannot see them as independent events. S&P 500 is one of the most iconic index to determine the overall energy and prosperity of the stock market, or even the whole economic system. If we can somehow explain the trend or the fluctuation of the SPX price, we can have a better grasp of the financial environment therefore make financial decision by taking the environmental factor into consideration.

This project attempted to explain the SPX weekly price with two time series models, SARIMA model, which is more general in explaining time series data, and ARCH/GARCH model, which tackles the problem of inconsistency of variance, which is very common specifically with financial time series data. In this project, we explain our weekly SPX price data with both models and assessed the quality of fit of the models, along with some forecasts and some potential directions of improvements, for more detail please click the link to see the entire project: Time Series and Stock Price


Linear Regression: NBA Salary Inference

Being some of the highest payed athletes in the world, NBA players’ salaries has experienced an unpredecending growth in any job market. This phenomenon brought the attention for scholars studying and exploring the relations, causation, and correlations of NBA salaries and different attributes, from the location of the team to the statistics the player puts up in their career. This project being my very first, I focused more on data analysis, data cleaning, and visualization in order to find the inferences of different variables with salary, rather than finding the best model to perform a prediction task. In this project, I used a simple linear regression model to discover the relations of variables. Although the outline of this project might seem simple, I did go to the extent of feature engineering to create a columns denoting the moving average of a player’s scoring and effectiveness related statistic, and I also found them to be positively correlated with the salary the NBA player receives for the following year. For the details of the project, click this link: NBA Salary Linear Regression